Not much analysis has been devoted to crime in Philadelphia. We felt the urge to provide an in depth analysis of crime statistics in Philadephia. This tutorial analyzes crime statistics provided at https://www.kaggle.com/mchirico/philadelphiacrimedata to better understand the situation of criminal activities in Philadelphia. This tutorial includes three sections: the first one is data collection and cleaning, the second one is visualization and exploratory analysis, and the third one is a linear regression model to process the analysis.
!pip install folium
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
import sklearn as skl
from sklearn import linear_model
import os
from folium import IFrame
from folium.plugins import MarkerCluster
import seaborn as sb
import matplotlib.dates as mdates
import datetime
import warnings
warnings.filterwarnings("ignore")
The CSV files that serves as a source of our data covers crime data from 2005 to 2017 in Philadephia.The data is from Kaggle and it is also supplied in crime.csv file,we read the csv file using the methods introduced in class and put the data frame into a pandas.
# Date Parser for the data
dateparse = lambda d: datetime.datetime.strptime(d,'%Y-%m-%d %H:%M:%S')
# Load the data into a dataframe.
df = pd.read_csv("crime.csv",
header=0,names=['Dc_Dist', 'Psa', 'Dispatch_Date_Time', 'Dispatch_Date',
'Dispatch_Time', 'Hour', 'Dc_Key', 'Location_Block', 'UCR_General',
'Crime_Type', 'Police_Districts', 'Month', 'Longitude',
'Latitude'],dtype={'Dc_Dist':str,'Psa':str,
'Dispatch_Date_Time':str,'Dispatch_Date':str,'Dispatch_Time':str,
'Hour':str,'Dc_Key':str,'Location_Block':str,
'UCR_General':str,'Crime_Type':str,
'Police_Districts':str,'Month':str,'Longitude':str,'Latitude':str},
parse_dates=['Dispatch_Date_Time'],date_parser=dateparse)
# Fix Month to datetime Month
df['Month'] = df['Month'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m'))
df.head()
Just like any data science project we have to devote time to clean the data of un needed information and to watch out for any missing data in the process.
# First drop Location_Block, Dc_Key, Dc_Dist, Dispatch_Date_Time and Hour columns
df = df.drop('Location_Block', axis=1)
df = df.drop('Dc_Key', axis=1)
df = df.drop('Dc_Dist', axis=1)
df = df.drop('Dispatch_Date_Time', axis=1)
df = df.drop('Hour', axis=1)
# Then add two columns for Month and Year and change the name of the exisitng Month column to Crime_Date
df["Year"] = pd.DatetimeIndex(df["Month"]).year
df["Day"] = pd.DatetimeIndex(df["Dispatch_Date"]).day
df["Crime_Date"] = df["Month"]
df["Month"] = pd.DatetimeIndex(df["Crime_Date"]).month
# Here since this dataset is very large, dropping all NaN values would clean it without causing any bias
df2 = df.dropna()
df2 = df2[df2.Year != 2017]
df2.index = range(len(df2.index))
df2.head()
Once we have cleaned the data and removed the unessary columns we are ready for analysis. We have included visualization of the data based on a few different attributes such as month,crime type,year. We have also included a map to provide deeper understanding.We have then provided statistical analysis to provide better comparison ability (in light of other simillar projects about other cities)
We have provided a bar graph from 2006 to 2017 to provide the trends in crime.As it can be seen there is a decreasing overall crime rate during this period which could be of importance.
sb.catplot(x="Year", kind="count", height=6, aspect=2, data=df2)
plt.xlabel("Year", fontsize=16)
plt.ylabel("Total Crime Number", fontsize=14)
plt.title("Number of Crimes commited per year", fontsize=16)
This graph groups cime rates by month.As it can be seen there is not a linear trend and crime tends to be more in the middle of the year and less in the begining and end.It goes without saying that this is also from 2006 to 2017.
sb.catplot(x='Month', kind='count', height=6, aspect=2, data=df2)
plt.xlabel("Month", fontsize=16)
plt.ylabel("Total Crime Number", fontsize=14)
plt.title("Number of Crimes commited per Month", fontsize=16)
The graph below shows frequency of crimes by the type of crime.It seems theft and assualt are very popular forms of crime.Arson,rapes,homicides and public alcohol consumption are the least common types. This graph helps us understand how different forms of crime compare against each other in their frequency. There is 33 different types of crime.
sb.catplot(y='Crime_Type', kind ='count', height=8, aspect=2, order=df2.Crime_Type.value_counts().index,
data=df2)
plt.xlabel("Number of Crimes", fontsize=16)
plt.ylabel("Type of Crime", fontsize=14)
plt.title("Number of Times a Specific type of Crime was Commited", fontsize=16)
To dig deep we look into various police districts and how they are related to cime rates. District 22 seems to be the least violent and district 11 seems to have the highest crime rates in various districts.
sb.catplot(x='Police_Districts', kind='count', height=6, aspect=2, order=df2.Police_Districts.value_counts().index,
data=df2)
plt.xlabel("Police District", fontsize=16)
plt.ylabel("Number of Crimes", fontsize=14)
plt.title("Number of Crimes commited per Police District", fontsize=16)
Can we predict anything about crime in Philadephia Based on the day?
We may also understand a few things about crime in Philly if we can relate it to various days and if this influences the crime.It seems that crime is lowest around Holidays. Each Holiday is marked on the graph and it can be seen that there is a sudden drop in crime rates around any holiday which is very surprising.
# Get average number of crime commited on each day of the Year for each 365 days
data_by_day_and_month = df.groupby(["Month", "Day"]).size() / 11
# Since only 3 leap year's have occured in the time period, divide it by 3 instead of 11.
data_by_day_and_month[2][29] = (data_by_day_and_month[2][29] * 11) / 3
# Make sure that the xticks in the following graph are always on the first of the month
leap_year = [31,29,31,30,31,30,31,31,30,31,30,31]
ticks = []
n = 0
for d in leap_year:
ticks.append(n)
n += d
plot = data_by_day_and_month.plot(figsize=(16,7), xticks=ticks, color="blue")
plot.set(xlabel="(Month, Day)", ylabel="Average Number of Crimes")
plot.set_title("Numbers of Crimes per day in Philadelphia", fontweight="bold", fontsize=16)
# Valentine's Day
plt.arrow(15,435,20,20, width=1, color="black", head_starts_at_zero=False)
plt.text(15,420, "Valentine's Day")
# Independence Day
plt.arrow(158,435,20,20, width=1, color="black", head_starts_at_zero=False)
plt.text(134,420, "Independence Day")
# Thanksgiving
plt.arrow(295,420,20,20, width=1, color="black", head_starts_at_zero=False)
plt.text(275,402, "Thanksgiving Week")
# Christmas
plt.arrow(330,220,20,0, width=1, color="black", head_starts_at_zero=False)
plt.text(305,218, "Christmas")
This section uses the map of philadelphia to show the trends in Crime Data in Philadelphia for a better understanding of where in the city are most of the crimes concentrated.
We have included three maps:
Note that the reason for using different years is mainly because folium maps don't fetch for a dataset larger than a certain number
Below is a scatterplot of crimes that occured in 2016.The black color represents violent or dangerous crimes,blue stands for theft while cyan represents petty crimes.
This shows that the more dangerous the crime the less it is common (or Expected).
Our graph uses a sample size of 42,000 this is a good number as it does provide very robust sample without a bias.It is also not such a big number to slowen the process.We were very mindful of the bias effect that could happen with smaller numbers.
from folium.plugins import HeatMap
from folium import plugins
from folium import FeatureGroup
from folium import IFrame
from folium.plugins import MarkerCluster
from random import randint
dangerous = ['Weapon Violations', 'Robbery Firearm', 'Homicide - Criminal', 'Aggravated Assault Firearm',
'Homicide - Gross Negligence', 'Homicide - Justifiable', 'Rape']
theft = ['Thefts', 'Theft from Vehicle', 'Motor Vehicle Theft', 'Receiving Stolen Property',
'Recovered Stolen Motor Vehicle']
map_osm = folium.Map(location=[39.95,-75.16], zoom_start=11)
arrest_loc = FeatureGroup(name="Crime")
temp_data = df2.sample(n=42000)[df2.Year == 2016]
for i,row in temp_data.iterrows():
if row['Crime_Type'] in dangerous:
arrest_loc.add_child(folium.Circle(radius=30, location=[row['Latitude'],row['Longitude']],color='black',fill=True))
elif row['Crime_Type'] in theft:
arrest_loc.add_child(folium.Circle(radius=30, location=[row['Latitude'],row['Longitude']],color='blue',fill=True))
else:
arrest_loc.add_child(folium.Circle(radius=30, location=[row['Latitude'],row['Longitude']],color='cyan',fill=True))
map_osm.add_child(arrest_loc)
map_osm.add_child(folium.map.LayerControl())
map_osm
This map is a Heatmap for time period betwen 2011 and 2016,it focuses on Violent crimes.Violent crimes are armed robberies,assualts,rape,and weapon violations. This graph shows where dangerous crimes happen most often
The dataset is from 2011 to 2016 because a larger dataset would not load and it would have been too big.
map_osm2 = folium.Map(location=[39.95,-75.16], zoom_start=11)
# creating a new dataframe with all the dangerous crimes in it
dangerous_data = df2[df2['Crime_Type'].isin(dangerous)]
# Add data for heatmp
data_heatmap = dangerous_data[dangerous_data.Year > 2010]
data_heatmap = data_heatmap[['Latitude','Longitude']]
data_heatmap = [[row['Latitude'],row['Longitude']] for index, row in data_heatmap.iterrows()]
HeatMap(data_heatmap, radius=10).add_to(map_osm2)
map_osm2
Again the heatmap concept is used to dispaly theft.Theft is a broad category which includes:
Since theft has been a common crime we could not load the same year span as violent crimes as the dataset was big.However the dataset for theft is still large enough to understand.
map_osm3 = folium.Map(location=[39.95,-75.16], zoom_start=11)
# creating a new dataframe with all the thefts crimes in it
theft_data = df2[df2['Crime_Type'].isin(theft)]
# Add data for heatmp
data_heatmap = theft_data[theft_data.Year > 2014].sample(frac=0.7)
data_heatmap = data_heatmap[['Latitude','Longitude']]
data_heatmap = [[row['Latitude'],row['Longitude']] for index, row in data_heatmap.iterrows()]
HeatMap(data_heatmap, radius=10).add_to(map_osm3)
map_osm3
Regression by Year and crime count.
#Use groupby and count functions to count the number of crimes in each year
data_by_year = df2.copy().groupby(df2['Year'], as_index=True, group_keys=True).count()
#Put indexes into the result table
count_by_year = data_by_year[['UCR_General']].reset_index()
#Instead of UCR_General, Count should be the name of the column
count_by_year = count_by_year.rename(index=str, columns={'UCR_General' : 'Count'})
count_by_year
# regression line for the number of crimes per year.
table = count_by_year
x_d=table['Year'].values
y_d=table['Count'].values
z=np.polyfit(x=x_d,y=y_d,deg=1)
f=np.poly1d(z)
x_n = np.linspace(x_d.min(), x_d.max(), 100)
y_n = f(x_n)
plt.figure(figsize=(15,10))
plt.plot(x_d, y_d,'o',x_n,y_n)
plt.xlabel("year")
plt.ylabel("number of crimes")
plt.title("Crimes in Philadelphia")
#linear regression 1
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
count_year = count_by_year[['Year', 'Count']].sort_values(by=['Year'], ascending=True).reset_index(drop=True)
npMatrix = np.matrix(count_year)
x_value = npMatrix[:,0]
y_value = npMatrix[:,1]
line = LinearRegression().fit(x_value,y_value)
m = line.coef_[0]
b_value = line.intercept_
print ("y = {0}x + {1}".format(m, b_value))
x_data = count_year['Year'].values
y_data = count_year ['Count'].values
minimum = x_data.min()
maximum = x_data.max()
result = sm.ols(formula="Count ~ Year", data=count_year).fit()
print (result.summary())
x1 = np.linspace(minimum, maximum, 100)
y1 = x1*m+b_value
data_by_type = df2.copy().groupby(df2['Crime_Type'], as_index=True, group_keys=True).count()
count_crime_type = data_by_type[['UCR_General']].reset_index()
count_crime_type = count_crime_type.rename(index=str, columns={'UCR_General' : 'Count'})
count_crime_type
# Linear regression 2
# Another regression based on Year and Crime Type
crime_type_year = df2.copy()
crime_type_year = crime_type_year[['Year','Crime_Type']]
#Get the count associated with year and crime type
crime_type_year = crime_type_year.groupby(['Year','Crime_Type']).size()
crime_type_year = crime_type_year.reset_index()
#Rename count column
crime_type_year['Count'] = crime_type_year[0]
crime_type_year = crime_type_year.drop(0,1)
#Fit the second regression
regression2 = sm.ols(formula='Count ~ Year + Crime_Type + Year * Crime_Type', data=crime_type_year).fit()
regression2.summary()
It is important to understand how trends in different types of crimes keep changing with time and which regions of the Philadelphia city are most affected by them. This tutorial is an example of how existing crime data can be used to understand these trends and make people more aware in order to keep the city safe.
In our analysis we concluded that the overall number of crimes are decreasing by years in Philadelphia. But on the other hand, disparities exist in these numbers when considering different factors like months, days, neighborhoods and police districts. It is also evident that some types of crimes are much more frequent than other types of crimes.
Further this tutorial can also be used for policy making and resource allocation based on the trends and disparities that were discovered in our analysis. For example, the police districts with highest crime rates should be given higher priority when framing policies related to crime reduction. Further it can also be used for predicting future trends using Machine Learning techniques.